Toward a Definitive Compressibility Measure for Repetitive Sequences

نویسندگان

چکیده

While the $k$ th order empirical entropy is an accepted measure of compressibility individual sequences on classical text collections, it useful only for small values and thus fails to capture repetitive sequences. In absence established way quantifying latter, ad-hoc measures like size notation="LaTeX">$z$ Lempel–Ziv parse are frequently used estimate repetitiveness. The notation="LaTeX">$b \le z$ smallest bidirectional macro scheme captures better what can be achieved via copy-paste processes, though NP-complete compute, not monotone upon appending symbols. Recently, a more principled measure, notation="LaTeX">$\gamma $ string attractor , was introduced. b$ lower-bounds all previous relevant ones, while length- notation="LaTeX">$n$ strings represented efficiently indexed within space notation="LaTeX">$O\left({\gamma \log \frac {n}{\gamma }}\right)$ which also upper-bounds many measures, including . Although arguably repetitiveness than notation="LaTeX">$b$ compute monotone, unknown if one represent in notation="LaTeX">$o(\gamma n)$ space. this paper, we study even smaller notation="LaTeX">$\delta \gamma computed linear time, allows encoding every string notation="LaTeX">$O\left({\delta {n}{\delta because notation="LaTeX">$z = O\left({\delta We argue that strings. Concretely, show (1) strictly by up logarithmic factor; (2) there families needing notation="LaTeX">$\Omega \left({\delta encoded, so optimal ; (3) build run-length context-free grammars whereas (non-run-length) grammar notation="LaTeX">$\Theta (\log n/\log times larger; (4) space, but offer logarithmic-time access its symbols, computation substring fingerprints, efficient searches pattern occurrences. further refine above results account alphabet notation="LaTeX">$\sigma string, showing {n\log \sigma }{\delta n}}\right)$ necessary sufficient support access, fingerprinting, matching queries.

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Development of A Questionnaire to Measure Attitude toward Oocyte Donation

Background To our knowledge, there is no valid and comprehensive questionnaire that considers attitude toward oocyte donation (OD). Therefore this study has aimed to design and develop a tool entitled attitude toward donation-oocyte (ATOD-O) to measure attitude toward OD. MaterialsAndMethods This methodological, qualitative research was undertaken on 15 infertile cases. In addition, we performe...

متن کامل

Junk DNA - repetitive sequences

Eukaryote and also human DNA contains large portion of noncoding sequences. As for the coding DNA, the noncoding DNA may be unique or in more identical or similar copies. DNA sequences with high copy numbers are then called repetitive sequences. If the copies of a sequence motif lie adjacent to each other in a block, or an array, we are speaking about tandem repeats, the repetitive sequences di...

متن کامل

A CF-Based Randomness Measure for Sequences

This note examines the question of randomness in a sequence based on the continued fraction (CF) representation of its corresponding representation as a number, or as D sequence. We propose a randomness measure that is directly equal to the number of components of the CF representation. This provides a means of quantifying the randomness of the popular PN sequences as well. A comparison is made...

متن کامل

A Distance Measure for Video Sequences

Video is a unique multimedia data type, in that it comes with distinguished spatio-temporal constraints. Content-based video retrieval thus requires methods for video sequence-to-sequence matching, incorporating the temporal ordering inherent in a video sequence, without losing sight of the visual nature of the information in the sequence. Such methods will require reliable measures of similari...

متن کامل

Aperiodicity Measure for Infinite Sequences

We introduce the notion of aperiodicity measure for in nite symbolic sequences. Informally speaking, the aperiodicity measure of a sequence is the maximum number (between 0 and 1) such that this sequence di ers from each of its non-identical shifts in at least fraction of symbols being this number. We give lower and upper bounds on the aperiodicity measure of a sequence over a xed alphabet. We ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: IEEE Transactions on Information Theory

سال: 2023

ISSN: ['0018-9448', '1557-9654']

DOI: https://doi.org/10.1109/tit.2022.3224382